The online handwritten flowchart dataset, CASIA-OHFC, was built by the National Laboratory of Pattern Recognition (NLPR), Institute of Automation of Chinese Academy of Sciences (CASIA). CASIA-OHFC contains 2,957 diagrams which were created from about 600 flowchart templates with various complexity. The diagrams were drawn by 205 writers using Huawei tablets, and each writer averagely drew 15 different diagrams given 15 different templates. Each diagram contains a number of handwritten strokes, and each stroke is a sequence of points recording the (x,y) coordinates, time, pressure and the state of pen tip. CASIA-OHFC involves 31 classes of symbols including common graphic symbols and text. A typical template (diagram) has about 10 graphic symbols, several connecting arrows, and some instances of texts (inside the graphic symbols or beside the arrows indicating the meaning of every operation). Two types of labels are provided for each stroke: the semantic class and instance ID of its associate symbol. The dataset is released in Ink Markup Language standard. Figure 1 shows an example of annotated online diagram.
Figure 1. An example of annotated online flowchart in CASIA-OHFC. Symbol classes are denoted by different colors. Symbol IDs are omitted for clean display.
The CASIA-OHFC dataset and corresponding printed flowchart templates are packed in zip archive. Please click the links below for download.
CASIA-OHFC (214MB)
flowchart templates (34.4MB)
A comprehensive description of the dataset has been published at IEEE Transactions on Multimedia 2021. Please refer to and cite X. -L. Yun, Y. -M. Zhang, F. Yin and C. -L. Liu, "Instance GNN: A Learning Framework for Joint Symbol Segmentation and Recognition in Online Handwritten Diagrams," in IEEE Transactions on Multimedia, doi: 10.1109/TMM.2021.3087000.
We collected 600 printed flowchart images from the Internet as the templates of CASIA-OHFC. The templates include text and 32 classes of graphic symbols. A full list of the symbol set can be found in Table 1.
Table 1. Flowchart Symbol Set
Although there are 32 classes graphic symbols in the collected flowchart templates, two extremely scarce classes (i.e., ’data storage’ and ’sequential access storage’) are removed from the raw online handwritten data. Therefore, there are 30 symbol classes in CASIA-OHFC. To prevent the writers’ drawing styles affecting performance evaluation, the diagrams are randomly divided into three subsets—a training set, a validation set and a test set at the ratio of 7:1:2 according to the 205 writers. Therefore, there are 143/20/42 writers in the training/validation/test set, respectively. An overview of the dataset is shown in Table 2. The file list of training set, validation set and test set are stored in file “CASIA-OHFC_TrainList.txt”, “CASIA-OHFC_ValidationList.txt”, “CASIA-OHFC_TestList.txt”, respectively.
Table 2. An Overview of CASIA-OHFC
Dataset
|
#Classes
|
Partition
|
#Writers
|
#Templates
|
#Diagrams
|
#Strokes
|
#Symbols
|
CASIA-OHFC
|
31
|
Train
Validation
|
143
20
|
600
215
|
2073
286
|
592267
86139
|
63368
8728
|
Test
|
42
|
385
|
598
|
171313
|
18280
|
The dataset can be used for online and offline handwritten diagram recognition, i.e., stroke classification and symbol segmentation and recognition. We find that some classes in CASIA-OHFC are very hard to recognize due to writing vagueness and the lack of training samples. To make the experimental results more stable, we merge the ’rounded rectangular’ symbol into ’process’ class, and combine the classes which have less than 90 symbol instances in the whole dataset into ’other’ category. Therefore, there are only 18 classes including symbols and ’text’ in our experiment. Note that the ’other’ class includes 13 small classes: ’stored data’, ’oval callout’, ’rectangular callout’, ’off page connector’, ’or’, ’summing junction’, ’card’, ’internal storage’, ’merge’, ’extract’, ’hard disk’, ’annotation’ and ’paper type’. We recommend researchers to evaluate flowchart recognition algorithms on 18 classes since the performance on the minority classes can be highly unstable.
The dataset is released in InkML format. File name is named after “template ID” “writer name” and additional word “revised”, and they are separated by character “_”, such as file name “123_ 张五_revised.inkml”. Each stroke is stored in field <trace> with a unique ID, and is consist of 5 channels, i.e., x-coordinate (X), y-coordinate (Y), pressure (F), pen tip state (S) and timestamp (T) respectively. States “1” “0” and “2” denote the pen down, pen move and pen up, respectively. A shortened inkml file is shown below:
<ink xmlns = "http://www.w3.org/TR/InkML">
<traceFormat>
<channel name = "X" type="decimal"/>
<channel name = "Y" type="decimal"/>
<channel name = "F" type="decimal"/>
<channel name = "S" type="integer"/>
<channel name = "T" type="decimal"/>
</traceFormat>
<annotation type = "UI">2018_NLPR_Flowchart</annotation>
<annotation type = "copyright">CASIA/NLPR/PAL</annotation>
<annotation type = "template">186</annotation>
<annotation type = "writer">林玮泽_revised</annotation>
<trace id = "0">
176.90005 69.88426 0.3385442 1 20063136,
175.90057 67.88593 0.3844651 0 20063160,
172.90213 67.88593 0.4772838 0 20063177,
…
149.91411 122.84013 0.5744992 0 20063409,
150.91359 119.84264 0.5725452 0 20063417,
150.91359 119.84264 0.5725452 2 20063425
</trace>
…
<trace id = "343">
507.72565 244.74106 0.3634587 1 20336935,
508.72516 245.74023 0.4191500 0 20336955,
508.72516 248.73773 0.4997557 0 20336969,
…
514.72198 254.73273 0.3898388 0 20337525,
514.72198 254.73273 0.3898388 2 20337530
</trace>
<traceGroups xml:id= "344">
<annotation type = "truth">Labeled Flowchart</annotation>
<traceGroup xml:id= "345">
<annotation type = "truth">TEXT</annotation>
<traceView traceDataRef= "0"/>
<traceView traceDataRef= "1"/>
…
<traceView traceDataRef= "19"/>
<traceView traceDataRef= "20"/>
<traceView traceDataRef= "21"/>
<annotationXML href = "TEXT_0"/>
</traceGroup>
<traceGroup xml:id= "346">
<annotation type = "truth">ELLIPSE</annotation>
<traceView traceDataRef= "22"/>
<annotationXML href = "ELLIPSE_0"/>
</traceGroup>
…
</traceGroups>
</ink>
A shortened inkml file example.
Every symbol instance is stored in the field <traceGroup> with a unique ID. The context in field <annotation> is the category of the symbol instance, such as “TEXT”, “ELLIPSE”. The value of attribute “traceDataRef” in the field <traceView> represents the corresponding strokes (IDs) consisted the symbol. And the value of attribute “href” in <annotationXML> indicates the symbol category and instance-level ID, and they are separated by “_”. Note that the IDs in <annotationXML> are not always continuous.
The online handwritten flowchart dataset, CASIA-OHFC, built by the CASIA, are released for academic research free of cost under an agreement.
Commercial use of the databases is subject to charge. For possible license of commercial use, please contact Fei Yin ( fyin@nlpr.ia.ac.cn). The database of commercial use is enlarged to contain all the online handwritten flowcharts.
The application form of the dataset for academic research can be downloaded bellowing:
X. -L. Yun, Y. -M. Zhang, F. Yin and C. -L. Liu, "Instance GNN: A Learning Framework for Joint Symbol Segmentation and Recognition in Online Handwritten Diagrams," in IEEE Transactions on Multimedia, doi: 10.1109/TMM.2021.3087000.
Cheng-Lin Liu ( liucl@nlpr.ia.ac.cn), Yan-Ming Zhang ( ymzhang@nlpr.ia.ac.cn)
National Laboratory of Pattern Recognition (NLPR)
Institute of Automation of Chinese Academy of Sciences
95 Zhongguancun East Road, Beijing 100190, P.R. China
24th International Conference on Pattern Recognition
15th International Conference on Frontiers in Handwriting Recognition
10th IAPR-TC15 Workshop on Graph-based Representations in Pattern Recognition
Haidian | Beijing | China
Phone : (+86-10)8254-4797
Fax : (+86-10) 8254-4594
Email:liucl@nlpr.ia.ac.cn
Website:www.nlpr.ia.ac.cn/pal/